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With  the  advances  in  VLSI  technology,  there  is  tremendous  deaand 
for  efficient  circnit  simulators  and  other  computer-aided  design 
tools.  One  of  the  factors  liaiting  the  design  time  of  VLSI  circuits 
is  the  slowness  of  circuit  simulation  programs.  Conventional  circuit 
simulators  ,  for  example,  SPICE  [18],  were  designed  initially  for  the 
cost-effective  analysis  of  circuits  containing  a  few  hundred  transis¬ 
tors  or  less.  Because  of  the  need  to  verify  the  performance  of 
larger  circuits,  many  users  have  used  programs  like  SPICE  [18]  and 
have  successfully  simulated  circuits  containing  thousands  of  transis¬ 
tors  despite  the  cost.  Because  these  circuit  simulation  programs  are 
slow,  designers  are  forced  to  use  less  accurate  models  and  to  make 
assumptions  during  the  simulations  in  order  to  save  computation  cost. 
At  the  circuit  level,  the  electrical  behavior  of  the  design  is 
modelled  in  terms  of  algebraic-differential  equations.  The  use  of 
implicit  integration  methods,  together  with  modified  Newton  iterative 
methods  for  solving  the  algebraic-differential  equations  representing 
the  circuit  model,  has  been  found  to  give  reliable  numerical  solu¬ 
tions,  and  has  thus  been  implemented  in  many  circuit  simulators,  such 
as  SPICE  [18]. 

A  number  of  approaches  have  been  used  to  improve  the  performance 
of  conventional  circuit  simulators  for  the  analysis  of  large  cir¬ 
cuits.  The  time  required  to  evaluate  complex  device  model  equations 
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has  been  reduced  by  using  table-lookup  models  [19].  Techniques  based 
on  special  purpose  microcodes  hare  been  investigated  for  reducing  the 
time  required  to  solve  sparse  linear  systems  arising  frcat  the  linear¬ 
ization  of  the  circuit  equation  [20].  Node  tearing  techniques  [5,15] 
have  also  been  used  to  exploit  circuit  regularity  by  bypassing  the 
solution  of  subcircnits  whose  state  is  not  changing  and  to  exploit 
the  vector-processing  capabilities  of  high  performance  computers  such 
as  the  Cray-1  used  in  the  simulation  program  (LASSIE  [21]  .  The 
advent  of  VLSI  technology  has  made  the  cost-effective  design  of  spe¬ 
cial  purpose  machines  possible.  Examples  of  these  machines  are  the 
Torktown  Simulation  Engine  (TSE)  for  logic  simulation  [22],  systolic 
arrays  [4]  and  the  wavefront  array  processors  [2].  Special  purpose 
machines  have  also  been  proposed  for  the  solution  of  linear  systems 
of  equations  [2].  Most  of  these  machines  limit  the  size  of  the 
operand  matrix  except  for  the  one  designed  by  Pottle  [7].  Special 
matrix  structures  such  as  the  Bordered  Block  Diagonal  Form  (BBDF)  are 
discussed  in  Blossom  [1]. 

The  procedure  involved  in  a  standard  circuit  simulation  is  shown 
in  Fig.  1  [16].  The  majority  of  the  time  spent  to  run  a  circuit 

simulation  can  be  lumped  into  two  categories:  the  time  required  to 
solve  the  system  of  sparse  linear  equations,  SGLVE  (steps(7)  and  (8)) 
and  the  time  required  to  form  the  entries  of  A  and  b  in  Ax*>b,  FORM 
(steps(5)  and  (6)).  These  two  steps  are  repeated  over  and  over 
again.  As  seen  in  Fig.  2,  for  small  circuits  (N  <  20),  the  majority 
of  the  solution  time  is  spent  performing  FORM.  However,  when  the 
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Fig.  2.  Transient  analysis  time  for  circuits  of  increasing  size  [16] 


size  of  the  circuit  grows,  sn  increasing  percentage  of  the  tine  is 
spent  in  the  SGLVE  phase  for  all  standard  circuit  simulators  running 
on  conventional  computers.  This  thesis  proposes  a  design  of  a  spe¬ 
cial  parallel  processor  to  be  used  in  performing  the  steps  involved 
in  SGLVE. 

There  are  two  methods  for  solving  large  systems  of  linear  equa¬ 
tions.  One  method  is  to  use  LIT  factorization,  forward  and  backward 
substitutions  on  the  entire  matrix  to  arrive  at  the  solutions  in  one 
pass.  The  other  method  is  to  use  an  indirect  relaxation  method  where 
certain  solutions  are  guessed  at  on  the  first  iteration.  The  new 
solutions  are  then  found  and  the  process  is  repeated  until  the  solu¬ 
tions  converge.  For  most  circuits,  the  fraction  of  nodes  which  are 
changing  their  voltage  values  at  a  given  point  in  time  decreases  as 
the  circuit  size  increases.  For  circuits  containing  over  500  MDS- 
FETS ,  fewer  than  20  percent  of  the  node  voltages  change  significantly 
over  a  simulation  time  step.  Circuit  simulators  exploit  this  time 
sparsity  or  latency  by  using  device-level  or  block-level  bypass 
schemes.  In  a  device-level  bypass  scheme,  if  the  terminal  voltages 
and  branch  currents  of  a  circuit  element  do  not  change  significantly 
from  the  previous  iteration,  its  contributions  to  A  and  b  in  Ax>=b  are 
not  re-evaluated,  and  the  values  computed  during  the  previous  itera¬ 
tion  are  used.  In  the  block-level  bypass,  both  the  matrix  element 
evaluation  and  the  node  solution  steps  are  bypassed  for  each  block  of 
inactive  connected  circuit  elements. 


The  indirect  nethod  is  snitsble  for  lsrge  NOS  digital  integrated 
circuits.  However,  for  nonlinear  analog  circnits  or  digital  circuits 
with  floating  capacitances,  the  solutions  are  not  guaranteed  to  con¬ 
verge,  or  if  they  do  converge,  the  convergence  rate  is  slow  [Id]  . 
Thus,  the  direct  method  is  more  suitable  for  analyzing  these  cir¬ 
cuits. 

The  types  of  circuits  that  are  analyzed  are  usually  very  large, 

with  thousands  of  nodes.  Although  sparse  matrix  techniques  have  been 

< 

found  to  be  computationally  efficient,  partitioning  the  system  matrix 
into  a  bordered-block  diagonal  form  has  been  suggested  as  one  way 
where  parallel  processing  could  be  used  to  speed  up  circuit  simula¬ 
tion  [5],  Another  approach  could  be  the  use  of  systolic  and  wave- 
front  array  processors  [2]  after  possibly  ordering  the  matrix  in 
band  form.  However,  the  application  of  systolic  arrays  and  wavefront 
array  processors  to  the  parallel  solution  of  bordered  block  diagonal 
partitioned  matrices  has  not  yet  been  investigated.  The  purpose  of 
this  thesis  is  to  propose  a  systolic  wavefront  array  architecture  for 
LU  factorization  of  partitioned  systems.  The  architecture  is  also 
compared  to  other  parallel  architectures  for  evaluation  purposes. 
Using  partitioning  techniques,  the  circuit  matrices  may  be  ordered 
into  bordered  block  diagonal  form  (BBDF)  [5].  Bordered  block  diago¬ 
nal  matrices  have  the  form  shown  in  Fig.  3.  The  circuit  can  be 
divided  into  subcircuits  which  are  connected  together  by  a  relatively 
small  number  of  nodes.  Some  heuristic  tearing  schemes  [14]  on  how 
the  reordering  can  be  done  have  been  published.  However,  to  obtain 
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maximum  concurrency,  a  nested  fora  of  the  clustering  algorithm  is 
proposed  in  this  thesis  to  reorder  the  nodes.  In  this  approach,  the 
entire  cirouit  is  first  divided  into  two  subcirouits  in  the  BBDF 
fora.  Each  of  the  tvo  subcicuits  is  further  divided  into  tvo  subcir¬ 
cuits  in  BBDF  in  a  recursive  manner.  This  procedure  continues  until 
each  subcircuit  is  approziaatel y  the  minimum  size  allowed.  LU  fac¬ 
torization  of  the  different  subcircuits  can  then  be  carried  out  in 
parallel.  However,  the  synchronization  of  the  solutions  at  the 
interconnection  level  is  a  problem  that  needs  careful  consideration. 

LU  factorization  involves  repetitive  computation  on  a  large  set 
of  data.  It  involves  a  aatriz  multi  plication- sub  tract  operation 
vhich  is  a  compute-bound  problem.  For  matrices  that  are  full,  a 
great  deal  of  concurrency  can  be  achieved  vith  cost-effectiveness  on 
single-instruction-multiple-data  (SIMD)  machines  because  of  the  regu¬ 
lar  structure  of  the  operations  performed  in  the  algorithm.  Hovever, 
as  the  matrix  becomes  sparse,  it  is  very  inefficient  to  use  the  same 
technique  and,  in  case  of  parallel  processing,  the  need  for  unstruc¬ 
tured  computations  is  too  complicated  for  cost  effective  solutions  on 
SIMD  machines.  This  is  because  the  LU  factorization  itself  involves 
a  great  deal  of  data  dependencies  vhich  lead  to  moderate  parallelism. 

The  alternative  architecture  is  multiple-instruction-multiple- 
data  stream  (HIMD) .  Hovever,  the  need  for  synchronization  of  data, 
the  amount  of  overheads  allotted  for  arbitration  be  tween  processors 
and/or  memory  conflicts  usually  degrade  the  performance.  Thus,  SIMD 
architecture  is  chosen  for  discussion. 


The  performance  evaluation  between  the  different  types  of  archi¬ 
tectures  for  LU  factorization  la  based  on  the  following  main  issues: 

1.  concurrency  achieved  by  the  machine 

2.  memory  access  and  I/O  time 

3.  synchronization  of  data 

4.  control  complexity 

5.  hardware  vs.  software  tradeoff  of  the  algorithm 

6.  performance-cost  effectiveness 

The  thesis  is  organized  as  follows.  In  Chapter  2,  the  steps 
involved  in  LU  factorization  are  clearly  defined  and  explained,  fol¬ 
lowed  by  a  study  of  general  purpose  and  special  purpose  architectures 
for  parallel  processing.  In  Chapter  3.  partitioned  systems  are 
solved  in  an  implementation  having  both  the  characteristics  of  the 

systolic  array  and  the  wavefront  method.  Chapter  4  shows  how  the 
entire  system  is  configured  using  the  direct  method  of  solving  the 
matrices,  whereas  Chapter  5  discusses  the  conf iguration  using  an 
indirect  Gauss-Seidel  method.  Chapter  6  is  the  conclusion  of  the 

thesis  and  suggests  future  direction  of  research.  The  block  diagram 

of  a  CMOS  processing  element  is  given  in  the  Appendix.  Such  a  pro¬ 

cessing  element  can  form  a  cell  in  a  systolic  array. 
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2 .1 .  Definition  of  LP  Factorization 

The  problem  that  this  paper  concentrates  on  is  the  LU  decomposi¬ 
tion  of  large  matrices  vhich  occurs  in  the  solution  of  linear  systems 
of  equations.  To  solve  the  problem  Az  *  y,  the  matrix  A  is  first 
decomposed  into  the  product  of  a  lover  triangular  (L)  and  an  upper 
triangular  (U)  matrix,  i.e..  A  m  LP.  Such  an  LP  decomposition  is 
unique  if  and  only  if  A  is  strongly  nonsingular  [10] .  Then  a  forward 
substitution  step  is  performed  where  z  is  determined  by  solving  Lz  * 
y.  The  solution,  x  .  can  then  be  obtained  by  solving  Px  ■  z.  This 
is  referred  to  as  the  direct  method  for  solving  linear  systems  of 
equations. 

Croat* s  algorithm  is  one  of  the  methods  used  in  the  decomposi¬ 
tion.  It  is  written  in  Pascal  as  follows: 

For  i:»  1  to  SIZE  do  begin 
for  j  :■  i+1  to  SIZE  do 

a[i, j]  :=  a[i, j]  /  a[i, i]; 
for  k:*  i+1  to  SIZE  do 
for  w:-  i+1  to  SIZE  do 

a  [k,  w]  :»  a[k,  w]  -  a[k,  i]  •  a[i,w]; 

end; 

where  it  is  assumed  that  at  every  step.  s[i.i]  is  not  0. 


The  efficiency  of  evaluating  the  inner  loop  in  the  above  algo¬ 
rithm  can  be  achieved  through  parallel  processing  of  the  machine  and 
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through  the  choice  of  e  good  algorithm.  The  multiplication  can  be 
speeded  up  by  special  purpose  hardware  whereas  the  addition  can  also 
be  speeded  up  by  faster  memories. 

This  algorithm  can  be  implemented  either  in  the  software  of  gen- 
eral  purpose  machines  or  in  the  hardware  of  special  purpose  machines. 
The  systolic  array  [4]  and  the  wavefront  array  [2]  processor  are  two 
different  hardware  implementation  of  the  algorithm.  The  basic 
difference  between  the  two  architectures  is  that  systolic  arrays  use 
a  synchronous  timing  scheme  whereas  the  wavefront  array  processors 
use  a  self-timed  handshaking  scheme.  Their  advantages  and  disadvan¬ 
tages  will  be  compared  in  the  following  sections. 

For  both  partitioned  and  non>partitioned  matrices,  there  are  two 
different  classes  of  methods  used  for  solving  linear  systems  of  equa¬ 
tions.  One  of  them  is  a  direct  method  in  which  the  results  are  com¬ 
puted  in  sequence  in  one  pass  through  the  hardware.  The  other  method 
is  an  indirect  method  such  as  a  relaxation-based  Gauss- Seidel  method. 
In  this  method,  an  initial  guess  is  made  for  the  values  of  the  inter¬ 
connection  nodes  in  the  first  pass  in  order  to  solve  for  the  values 
of  the  subcircuit  nodes.  The  new  values  of  the  interconnection  nodes 
are  then  compared  with  the  previous  guess.  The  whole  procedure  is 
repeated  until  the  computed  values  converge.  Latency  [5]  can  be 
explored  in  this  method  so  that  only  certain  subcircuits  need  be 
updated  after  each  iteration.  If  the  solutions  of  these  subcircuits 
are  not  within  a  specified  tolerance,  more  iterations  sre  needed. 
This  procedure  should  reduce  the  complexity  of  evaluating  partitioned 
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systems  as  will  be  seen  in  tbe  next  chapters;  but  is  recommended  to 
be  used  when  the  iteration  process  is  expected  to  converge  reasonably 
fast. 

These  two  methods  are  discnssed  in  the  following  chapters  after 
a  hardware  scheme  is  chosen  for  the  LU  factorization  array. 

2.2.  GENERAL  PURPOSE  vs.  SPECIAL  PURPOSE  ARCHITECTURE 

2.2.1.  General  Pnrnose  Architectnre 

The  main  advantage  of<  general  purpose  machines  is  the  flexibil¬ 
ity  they  offer.  Different  operations,  algorithms,  sizes  of  matrices 
and  sparsities  of  matrices  can  be  evaluated  by  changing  the  software 
program.  If  a  certain  algorithm  is  chosen,  the  LU  factors  of  dif¬ 
ferent  matrices  can  be  computed  with  only  very  slight  modifications, 
e.g.  ,  changing  certain  parameters  or  adding  several  subroutines  to 
handle  sparsity.  However,  performance  is  sacrificed  for  this  flexi¬ 
bility.  A  great  deal  of  buffering,  I/O  and  memory  access  and  execu¬ 
tion  overheads  result.  A  more  complicated  instruction  set  must  be 
implemented  in  general  purpose  machines  for  various  applications. 
Additional  time  is  needed  to  decode  these  instructions  and  extra 
hardware  and  software  must  be  incorporated  as  well  to  implement  these 
instructions.  There  is  also  the  data-routing  problem  to  get  the 
sequential  data  into  and  out  of  parallel  arrays.  This  operation 
requires  complex  address  calculations  which  are  expensive  on  general 


purpose  machines 


Another  disadvantage  is  that  with  the  fixed  word  length  in  the 
machine,  rounding  errors  in  algebraic  processes,  if  not  properly  con¬ 
trolled,  nay  lead  to  unreliable  solutions.  Special  purpose  Machines 
can  be  designed  with  very  good  numerical  control  for  these  scientific 
coaputations.  This  can  be  achieved  with  special  hardware  for  pivot¬ 
ing  [6]  so  that  the  denoainators  of  the  divisions  daring  LU  factori¬ 
zation  are  always  the  largest  numbers  along  the  columns.  This  is 
usually  done  in  the  software  of  general  purpose  machine  and  is  a 
time-consuming  process. 

The  fact  that  fast  general  purpose  machines  are  complicated  to 
design  and  are  usually  very  expensive  calls  for  the  design  of  special 
purpose  machines  that  can  be  connected  to  an  existing  and  relatively 
inexpensive  host  machine  for  fast  numerical  computation. 

2.2.2.  Special  Purpose  Architecture 

The  hardware  cost  and  size  of  both  special  and  general  purpose 
machines  are  relatively  insignificant  compared  to  the  software  cost. 
However,  the  design  cost  of  special  purpose  processors  is  usually 
much  less  than  general  purpose  processors.  A  special  purpose  machine 
usually  consists  of  simple  processors  of  the  same  hind  connected  by  a 
network  of  local  and  regular  interconnects.  Extensive  concurrency 
can  be  achieved  if  the  algorithm  is  designed  to  introduce  high 
degrees  of  pipelining  and  multiprocessing.  Data  can  be  routed  to  the 
appropriate  processor  directly  from  memory  to  minimize  the  memory 
access  time,  which  is  the  bottleneck  of  most  algorithms.  Thus,  mul¬ 


tiple  coaputations  can  be  performed  per  I/O  access. 
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The  bottleneck  of  using  special  purpose  mesh  connected  proces¬ 
sors  is  the  time  wasted  in  interprocessor  data  movement  between  the 
special  purpose  processors  and  the  host.  However,  the  computation 
can  usually  be  done  faster  than  general  purpose  machines.  Less  over¬ 
heads  are  involved  in  decoding  instructions  and  buffering  because  the 
architecture  is  geared  only  towards  the  computation  needed. 

One  disadvantage  of  special  purpose  machines  is  their  inflexi¬ 
bility.  If  a  different  faster  algorithm  for  LU  factorization  is 
needed,  the  whole  machine  has  to  be  redesigned  and  rewired.  Also,  in 
special  cases,  for  example,  sparse  matrices  which  cannot  be  ordered 
in  the  block  diagonal  form  may  require  complicated  hardware  circuitry 
in  the  special  purpose  machine  as  opposed  to  changing  the  software 
programs  in  the  general  purpose  machine.  Thus,  there  is  a  hardware 
versus  software  tradeoff  between  the  two  implementations. 

Because  of  the  inflexibility  of  special  purpose  processors,  a 
well  designed  algorithm  is  an  important  starting  point.  However,  a 
good  algorithm  for  VLSI  implementation  may  not  necessarily  be  one 
requiring  minimal  computation.  VLSI  implementation  is  preferred  for 
special  purpose  processors  to  reduce  difficulties  in  reliability, 
performance  and  heat  dissipation  that  arise  from  many  SSI  and  MSI 
components.  VLSI  technology  enables  faster  communication  and  more 
accurate  system  clocking  due  to  the  effect  of  scaling  all  components 
to  several  transistors  and  several  metal  and  poly  lines.  VLSI  also 
enables  more  modularity  and  progr ammabil ity  because  of  the  cost- 
effectiveness  of  the  design. 


In  inniry,  the  criteria  for  special  purpose  Machine s  are: 

1.  The  design  must  be  implemented  by  only  a  few  different  types  of 


aiaple  cells  to  cut  down  on  design  cost. 

2.  Its  architecture  aust  be  based  on  a  siaple  and  regular  data  and 
control  flow  that  can  be  connected  by  a  network  with  local  and  reg¬ 
ular  interconnections. 

3.  It  aust  fully  utilize  pipelining  and  aul ti processing.  Several 
data  streaas  can  move  at  a  constant  velocity  over  fixed  paths  in 
the  network,  interacting  at  cells  where  they  aeet. 

4.  A  large  number  of  cells  are  active  at  one  time  so  that  computa¬ 
tion  speed  can  keep  up  with  data  rate. 

5.  The  siaple  cells  are  identical  and  connected  in  a  regular 
fashion  to  increase  modularity  and  extensibility  for  different 
aatrice  s. 

Considering  the  above  advantages  and  disadvantages,  two  dif¬ 
ferent  multiprocessor  arrays,  the  systolic  array  and  the  wavefront 
array  processors,  are  compared  for  their  effectiveness  in  parallel 
proce  ssing. 

2  *2  •  Special  Purpose  Architecture 

i  .3  .1 .  Systolic  Array  Processors 

The  systolic  array  is  chosen  as  the  special  purpose  machine  for 
discussion  because  it  satisfies  all  the  criteria  listed  for 
performance /cost  effective  special  purpose  machines.  Also,  systolic 
arrays  facilitate  VLSI  implementation  by  encouraging  modular  design 
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and  local  communication  whereas  many  other  special  purpose  machines 
that  have  been  proposed  [7,10]  usually  require  LSI/MSI  parts  with 
massive  controls  and  complicated  hardware. 

The  term  "systole"  means  a  rhythmically  recurrent  contraction, 
especially  the  contraction  of  the  heart  by  which  the  blood  is  forced 
onward  and  circulation  kept  up.  This  is  analogous  to  the  systolic 
array  idea. 

A  systolic  array  processor  consists  of  an  array  of  simple  pro¬ 
cessors  connected  together  locally  with  inputs  and  outputs  connected 
to  a  bus  which  is  controlled  by  a  host  computer.  The  systolic  system 
rhythmically  computes  and  passes  data  through  the  system.  All  the 
inputs  in  the  systolic  cell  move  to  the  adjacent  processor  of  the 
cell  during  each  beat.  The  computation  is  then  done  rhythmically 
(i.e.,  depending  on  the  beat)  and  the  result  is  shifted  as  input  to 
the  next  cell  in  the  next  beat.  This  is  the  basic  data  flow  of  the 
architecture. 

According  to  S.Y.Kung  [13],  a  systolic  array  must  have  the  fol¬ 
lowing  properties: 

1.  Synchrony:  The  data  transfer  and  computations  must  be  controlled 
by  a  global  clock. 

2.  Regularity:  The  array  must  consist  of  modular  processing  elements 
with  regular  and  spatially  local  interconnections.  Moreover,  the 
network  must  be  extendable. 

3.  Temporal  Locality:  At  least  one  unit  time  delay  must  be  allotted 
so  that  signal  transactions  from  one  node  to  the  next  can  be 


couple  ted. 

4.  Linear-rate  Pi pel  inability:  The  array  must  achieve  an  O(n) 
speedup,  where  n  it  the  number  of  processing  elements. 

S. T.  Sung  [13]  has  also  devised  a  systematic  way  of  syatolizing 
an  algorithm.  A  signal  flow  graph  (SPG)  is  constructed  to  represent 
the  input  and  output  relationships  of  an  algorithm.  Then  the  basic 
operation  module  is  selected.  After  that,  two  cut-set  (temporal) 
localization  procedures  are  applied: 

Rule  (i):  Time-Scaling:  all  delays  may  be  scaled  by  a  single  positive 
integer. 

Rule  (ii):  Del  ay- Transfer:  advance  k  time  units  on  all  the  outbound 
edges  and  delay  k  time  units  on  the  inbound  edges,  and  vice  versa. 

In  the  final  step,  combine  the  delay  of  the  operation  modules  with 

the  module  operation  to  form  a  basic  systolic  element.  All  the  extra 

delays  will  be  modeled  as  pure  delays  without  operations. 

The  signal  flow  graph  for  LU  decomposition  is  shown  in  Fig.  4. 
Its  corresponding  systolic  array  is  shown  in  Fig.  3.  The  array  is 
organized  like  a  honeycomb  in  a  rhombus  shape.  The  following  func¬ 
tions  are  performed  by  each  cell: 

Aout  *  Ain  -  Lin  •  Din 

Dout  ”  Din  or  Ain  (at  the  boundary) 

Lout  3  Lin  or  Ain/Din  (at  the  boundary) 

The  nmnber  of  cells  needed  in  this  topology  is  equal  to  the  number  of 
elements  in  a  full  matrix,  e. g. ,  for  a  full  4x4  matrix,  16  cells 
are  needed.  For  a  banded  matrix,  the  number  of  cells  needed  is  the 
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width  of  the  first  row  tiaes  the  height  of  the  first  col nan  of  the 
astriz.  Only  1/3  of  the  cells  ere  performing  computation  at  each 
clock  cycle  while  all  of  them  are  actively  shifting  data  of  soae 
sort. 

Because  it  is  difficult  to  control  the  pipeline  of  systolic 
arrays,  Kai  Hwang  [6]  has  proposed  a  rectangular  iapleaentation  that 
uses  n  cells  less  for  an  n  z  n  aatriz  but  coaputes  the  LU  factors 
faster  than  the  above  aethod.  The  speed  is  accoaplished  by  utiliz¬ 
ing  aore  cells  for  coaputation  st  any  given  tiae.  This  aethod  is 
different  froa  the  systolic  array  because  interface  latches  are  used 
to  store  interaediate  results  instead  of  using  registers  in  cells. 
As  a  result,  synchronization  of  data  is  very  crucial  in  systolic 
arrays  whereas  data  can  be  stored  in  the  latches  in  the  Hwang  aethod. 
The  interconnections  and  inputs  of  this  topology  are  as  shown  in  Fig. 
6.  The  nuaber  of  cells  needed  by  the  4  z  4  full  aatriz  is  12,  which 
is  less  than  the  regular  systolic  array. 


2 .3  .1 .1 .  Advantages  of  Systolic  Arrays 


The  systolic  cell  maziaizes  the  use  of  input  data  fetched  froa 
memory  with  modest  I/O  bandwidths  for  outside  communication  with  the 
host.  This  is  done  by  inputting  data  at  the  appropriate  cell  at  a 
regular  rate  and  then  shifting  the  data  to  the  appropriate  processor 
for  coaputation.  Thus,  no  aultiple  memory  access  of  the  saae  data  is 
needed  and  no  additional  complez  address  calculations  are  required  to 
retrieve  data.  When  the  memory  speed  is  larger  than  the  cell  speed, 
two-diaensional  systolic  arrays  are  used  because  at  each  cell  cycle. 
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Pig.  6.  VLSI  Pipeline  for  LU  decomposition  using  the  Hwang 
Implementation  [6] . 
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all  the  I/O  porta  on  array  boundaries  can  be  input  or  output  data  to 
or  from  memory.  Memory  bandwidth  can  be  fully  utilized  and  substan¬ 
tially  reduces  processor-memory  traffic  (memory  access  bottleneck)  in 
comparison  to  that  required  in  other  architectures. 

The  algorithm  on  which  the  hardware  of  the  systolic  array  is 
based  fully  exploits  parallel-pipelined  processing  and  speeds  up 
compute-bound  computations.  Multiprocessing  many  results  in  parallel 
is  achieved  by  converting  the  output  (Aout)  as  input  (Ain)  to  the 
next  cell,  and  computation  of  each  single  result  within  the  cell  is 
also  pipelined.  Thus  the  systolic  cell  is  a  two-stage  pipeline. 
System  cycle  time  is  the  time  of  a  stage  of  a  cell  and  not  the  whole 

array  of  cells  at  cycle  time.  Very  fast  computation,  up  to  200  Mflops 

(million  floating  point  operations  per  second)  [3]  ,  can  be  achieved. 

The  cells  are  also  simple  and  identical.  The  cell  can  basically 

consist  of  shift  registers  or  latches,  a  multiplier,  a  reciprocator 
and  an  accumulator.  A  local  memory  may  also  be  included  to  store 
intermediate  operands  and  results.  Data  and  control  flows  are  simple 
and  regular  as  shown  in  the  Kai  Hwang  [6]  systolic  array. 

The  design  of  the  systolic  cell  is  completely  modular  and 
expandable  with  no  difficult  synchronization.  Additional  cells  can 
be  connected  together  in  the  same  topology  for  larger  matrices.  In 
fact,  as  the  number  of  cells  expands,  system  cost  and  performance 
increase  proportionally  if  the  size  of  the  underlying  problem  is  suf¬ 
ficiently  large. 


One  other  advantage  of  apeeial  purpose  na chines  over  general 
purpose  machines  is  that  the  software  overhead  associated  with  opera¬ 
tions,  snch  as  indexing,  are  totally  eliminated. 


2 .3 .1 .2  .  Disadvantages  of  Systolic  Arrays 

Global  synchronization,  however,  could  be  a  disadvantage  as  the 
size  of  the  array  increases.  Yhile  an  asynchronous  model  incurs  a 
fixed  time  delay  overhead  due  to  the  handahaking  process,  the  syn¬ 
chronous  time  delay  is  primarily  due  to  the  clock  skew  which 
increases  with  the  size  of  the  array.  Even  in  an  B-tree  clock  dis¬ 
tribution  [8]  which  maintains  the  same  on-chip  clock  line  length  to 
each  of  the  N  x  N  modules,  the  clock  skew  grows  at  a  rate  of  OCN3). 
In  addition,  the  systolic  arrays  are  rather  inflexible  to  changes  in 
the  algorithm,  for  example,  the  partitioning  of  sparse  matrices  into 
block  diagonal  form  for  higher  concurrency  may  lead  to  interconnec¬ 
tion  hardware  that  is  extremely  complex  to  design,  whereas  solving 
the  huge  sparse  matrices  using  unpartitioned  matrix  systolic  opera¬ 
tion  is  a  waste  of  processor  and  waste  of  computation  time  as  most 
elements  are  zero  elements  unless  the  matrix  is  banded.  Even  for 
banded  matrices,  there  is  a  limit  of  4u  on  execution  time  [13]. 
Interconnection  of  the  submatrices  of  sparse  matrices  ,  if  effi¬ 
ciently  solved  in  hardware,  may  lead  to  special  purpose  machines  with 
very  high  throughput.  Several  schemes  that  are  not  systolic  in 
nature  have  been  proposed.  They  are  the  LU  unstructured  sparse 
matrix  machine  by  Pottle  [7],  partitioned  matrix  by  Ewang  and  Cheng 
[10]  and  the  Blossom  System  by  Sangi ovsnni- Vince ntel 1 i  [1]. 


2 •! 4 *1  •  Timing  Analysis  and  Number  of  Processors 
Poor  schemes  need  for  LU  factorization  are  considered: 

1.  Regular  LU  banded  matrix 

Decomposition:  T  ■  3n  +  min(r.s);  N  -  rs 
Triangularization:  T  *  3n;  N  ■  (n  +  n)/2 

2.  LU  unstructured  sparse  matrix  by  Pottle 

T  ■  nl;  N  >■  dmax 

3.  Partitioned  matrix  algorithm  by  Hwang  and  Cheng 

T  -  0(n)  or  0(n/m);  N»  0(n)  or  O(Nm) 

4.  Blossom  by  Sangiovanni-Vincentel i 

T  «  0(n/x)  for  full  matrices,  else  O(Nwh);  N  -  x 
where, 

T  is  the  time  complexity 

N  is  the  number  of  processors 

n  is  the  maximum  size  of  the  matrix 

r+s-1  is  the  bandwidth  of  a  banded  matrix 

r  is  the  width  of  the  first  row  of  the  banded  matrix 

s  is  the  length  of  the  first  column  of  the  banded  matrix 

1  is  the  mean  degree  of  columns  of  a  sparse  matrix 

m  is  submatrix  size 

w  is  the  original  border  width 

h  is  a  function  of  x  and  m 

dmax  is  the  maximum  nmber  of  nonzeros  in  any  row  of  the  operand 

The  hardware  suggested  by  Pottle  [7]  is  difficult  to  design  and 
involves  a  great  deal  of  control  signals  and  local  storage.  More- 


over,  there  is  not  enough  information  on  the  processor  array  used  in 
the  Blossom  system  [1].  However,  the  systolic  array  and  the  Hwang 
[6]  scheme  are  easy  to  control  and  to  design.  The  systolic  array  is 
more  suitable  for  banded  matrices  while  the  Hwang  [6]  scheme  is 
better  suited  for  full  matrices.  However,  in  our  applications,  the 
matrix  is  assumed  to  be  partitioned  into  submatrices  so  that  each 
submatrix  is  nearly  a  full  matrix. 


.  Wavefront  Array  Processor  Architecture 


The  wavefront  array  processor  (WAP)  is  configured  in  a  square 
array  of  NxN  processing  elements  with  regular  and  local  interconnec¬ 
tions  (Fig. 7)  .  The  computing  network  serves  as  a  (data)  wave- 
propagating  medium.  The  computational  sequence  starts  with  one  ele¬ 
ment  and  propagates  through  the  processor  array,  closely  resembling  a 
physical  wave  phenomenon.  A  second  wavefront  can  then  be  pipelined 
immediately  after  the  first  one  propagates.  Also,  wavefronts  of  two 
successive  recursions  of  the  software  loop  will  never  intersect 
(Huygen' s  Wavefront  Principle)  [13]  because  the  processors  executing 
the  recursions  at  any  given  instant  will  be  different,  thus  avoiding 
any  contention  problems. 

According  to  S.7.  Sung  [13]  ,  a  wavefront  array  is  a  computing 
network  which  possesses  the  following  properties: 

1.  Self-Timed,  Data-Driven  Computation:  The  network  is  data-driven, 
i.e.  ,  the  computation  is  fired  as  soon  as  the  data  arrives;  thus,  no 
global  clock  is  needed. 


2.  Nodularity  and  Local  Interconnection:  This  property  is  the  same 
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Fig.  7.  The  WAP  conf  iguration  [2], 


as  the  systolic  array,  except  that  the  wavefront  array  can  be 
extended  indefinitely  without  global  synchronization  probleas. 

3.  0(n)  Speedup  and  Pipel inability:  similar  to  the  systolic  array. 

Thus,  the  only  major  difference  between  the  wavefront  and  sys¬ 
tolic  array  is  the  data-driven  property.  Therefore,  temporal  local¬ 
ity  is  no  longer  needed  in  the  wavefront  array  and  thus  the  wavefront 
array  is  faster  and  easier  to  program. 

To  perform  LU  factorization,  each  recursion  consists  of  a  main 
wavefront  and  an  alternate  wavefront.  The  main  wavefront  computes  Li 
at  the  (*,i)  processors  and  sends  it  left  to  form  the  columns  of  L  in 
the  left  memory  modules.  It  also  computes  Ui  at  t>'  (1,*)  processors 
and  sends  it  up  to  form  the  rows  of  U.  The  means  all  the  ele¬ 
ments  in  a  particular  row  or  column  indicated.  In  the  interior  pro¬ 
cessors,  Ai  *  A(i-l)  -  Li  •  Di  is  computed.  The  main  wavefront  must 
also  send  the  new  Ai '  s  up.  The  alternate  wavefront  sends  Aij  down 
and  left.  After  the  first  pair  of  wavefronts  is  initiated,  the 
second  pair  can  be  pipelined  immediately. 

1 4 4  *i  •  Advantages  and  Disadvantages  of  Wavefront  Array  Proces¬ 
sors 

The  WAP  has  the  advantage  of  the  systolic  array  in  the  fact  that 
it  is  very  modular  and  with  local  interconnection.  However,  a  wave- 
front  architecture  can  also  provide  asynchronous  waiting  capability 
and  consequently  can  cope  with  timing  uncertainties  such  as  local 
clocking,  random  delay  in  communications  snd  fluctuations  of  comput- 
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ing  times,  Thus,  the  notion  lends  itself  to  a  (a synchronous)  data¬ 
flow  computing  structure  that  conforms  well  with  the  constraints  of 
VLSI. 

The  wavefront  notion  drastically  reduces  the  coaplexity  in  the 
description  of  parallel  algorithms.  The  mechanism  provided  for  this 
description  is  a  special-purpose,  wavefront  oriented  language. 
Rather  than  requiring  a  program  for  each  processor  in  the  array,  this 
language  allows  the  programmer  to  address  an  entire  front  of  proces¬ 
sors.  This  solves  the  problem  of  programmability  and  extensibility 
of  systolic  arrays.  . 

The  TAP  shares  a  hey  concept  with  data-flow  machines:  the 
arrival  of  data  fires  each  processing  element  (PE),  which  subse¬ 
quently  sends  relevant  data  to  the  next  Hi.  The  WAP  can  be  regarded 
as  a  homogeneous  data-flow  processing  element.  The  ’VATI"  for  data 
feature,  provided  by  handshaking,  allows  for  the  globally  asynchro¬ 
nous  operation  of  processors,  i.e.,  there  is  no  need  for  global  syn¬ 
chronization.  Scheduling  and  synchronization  are  built  at  the 
hardware  level.  These  qualities  make  the  WAP  very  appealing  for  VLSI 
implementation  as  well.  Figs. 8  and  9  show  the  difference  between 
synchronous  and  asynchronous  handshaking.  Fig.  10  shows  the  delay 
modules  for  two  adjacent  asynchronous  modules.  The  loop  delay,  dA, 
is  the  time  delay  between  servicing  successive  data  bits  for  the 
asynchronous  architecture.  It  is  given  by  [8]: 

dA  -  2dL  +  dF  +  dP 

where  dL  *  propagation  delay  of  the  combinational  logic 


Crossbar  interconnection  network  using  asynchronous 
modules  [8] .  BO  is  used  for  logic  0  data  and  R1  for 
logic  1  data.  A  is  the  acknowledge  signal. 
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dF  =  propagation  delay  of  tlie  feedback  path 
dP  ■  propagation  delay  along  the  request  path  from 
module  i  to  module  j 

However,  the  synchronous  delay  module  as  shown  in  Fig.  11  is  given 
by: 

dS  -  dL  +  2dM  +  dP  +  dC 

where  dM  *  propagation  delay  of  the  memory  elements 
and  dC  “  dci  -  dcj  *  clock  skew 

Thus  for  large  dP,  the  synchronous  system  data  rate  is  higher  than 
that  of  the  asynchronous  system  whereas  as  the  clock  skew  is 
increased,  the  asynchronous  system  is  better. 

In  conclusion,  the  VAP  is  an  optimal  tradeoff  between  globally 
synchronized  and  dedicated  systolic  array  and  general  purpose  data¬ 
flow  multiprocessor.  The  Hwang  (rectangular)  configuration  of  the 
systolic  array  is  analogous  to  the  WAP  .  The  difference  is  that 
input  data  is  synchronously  fed  into  the  array  in  the  Hwang  implemen¬ 
tation  whereas  in  the  WAP,  the  network  is  data-driven.  However,  the 
asynchronous  handshaking  and  the  wavefront  language  based  programming 
control  can  be  implemented  for  this  rectangular  systolic  array. 
Thus,  the  rectangular  configuration  is  chosen  for  analysis  of  the  LU 
factorization  of  both  normal  bordered  block  diagonal  form  (BBDF)  and 
nested  BBDF  matrices.  This  thesis  describes  the  design  of  a  parallel 
processor  array  for  LD  factorization  by  modifying  and  combining  the 
traditional  systolic  array,  the  Hwang  implementation  and  the  wave- 
front  array  processor.  This  modified  design  has  the  characteristic 
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of  the  systolic  array  because  within  each  level  of  the  processor 

array,  the  data  are  fed  in  a  synchronous  manner,  each  beat  being  con-* 

trolled  by  a  system  clock.  The  Hwang  implementation  of  the  interface 
latches  are  used  because  more  processing  elements  are  then  utilised 
for  computation.  In  addition,  the  asynchronous  handshaking  scheme  is 
very  useful  for  taking  care  of  the  irregularity  of  the  outputs  from 
the  previous  processing  level.  The  asynchronous  handshaking  only 
needs  to  take  place  at  the  cells  located  at  the  edge  of  the  processor 
array  where  the  input  data  are  fed.  The  software  program  can  gen¬ 
erate  an  input  data  valid  flag  to  start  the  wavefront  of  input  data 

at  each  clock  pulse.  The  Appendix  gives  the  VLSI  design  of  such  a 
semi- synchronous  processing  element.  It  is  semi- synchronous  because 
it  has  both  a  system  clock  and  asynchronous  handshaking  signals.  The 
asynchronous  handshaking  is  basically  needed  as  the  outputs  of  a  cer¬ 
tain  level  of  processor  array  flows  as  inputs  to  the  next  level. 
This  is  because  the  processing  elements  may  not  produce  the  outputs 
in  the  pattern  needed  as  inputs  to  the  subsequent  levels. 


AfcCHlTE&UKB  FOK  LU  FACFOUZATION  OF  PARTITIONED  SYSTEMS 


1.1.  Characteristics  of  the  Matrices 


Moat  of  the  natricea  in  circuit  simulation  programs  are  large 
and  sparse.  Each  node  in  the  circuit  is  connected  to  a  small  subset 
of  nodes.  Thus,  the  circuits  can  usually  be  ordered  into  Bordered- 
Bloch  Diagonal  Form  (BBDF)  [5] . 
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Parallel  processing  can  be  utilized  by  finding  the  LU  factors  of 
all  the  subcircuits  at  the  same  time  and  then  the  results  are 
automatically  merged  properly  together  to  form  the  interconnection 
level.  The  following  matrices  are  then  factored  in  parallel: 
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As  discussed  in  the  previous  section,  a  systolic  approach  can  be 
used  in  the  architecture.  Two  implementations  of  the  mesh  connec¬ 
tion  are  studied:  the  Hexagonal  systolic  array  by  Kung[4]  and  the 


rectangular  network  by  Swang  [6].  It  is  found,  using  hand  simula¬ 


tions,  that  the  Hwang  [6]  scheme  takes  advantage  of  maximum  processor 


utilization  as  well  as  taking  fewer  clock  cycles  for  processing  the 


matrix.  The  hexagonal  systolic  array  can  also  be  described  as  a  rec¬ 


tangular  array.  However,  the  difference  between  the  Hwang  [6]  method 


and  the  regular  systolic  array  is  in  the  use  of  interface  latches  in 


the  Hwang  [6]  method  so  that  the  timing  is  different  between  the  two. 


A  modified  scheme  of  these  two  methods  is  studied  in  the  next  sec¬ 


tion. 


The  partitioned  matrix  is  better  suited  for  parallel  processing 


than  the  banded  matrix.  It  can  be  seen  from  a  comparison  of  the  time 


it  takes  a  regular  systolic  array  to  compute  the  LU  factors  of  a 


banded  matrix  and  a  BBOF  matrix  (Fig.  12) .  As  p  and  q  gets  <<  n  and 


as  m  gets  larger,  t(partitioned)  <  t(banded).  Using  the  systolic 


array  by  H.  T.  Kung  [4], 


t(band)  =  3n  +  min  (p,q)  ~  3n  +  p 


Assume  q  is  either  larger  or  equal  to  p. 


t(partitioned)  <  t(sub)  +  t(int) 


t(sub)  *  4[p+n-mp] 


t(int)  <=  4(n-mp) 


Asstnte  summation  level  is  done  almost  simultaneously.  Also,  assume  a 


full  matrix  at  the  interconnection,  whereas  most  of  the  time,  the 


interconnection  can  be  ordered  as  a  band  matrix. 


t(part)  ■  4[p+n-mp+n-mp]  *  4[p+2n-2mp] 


One  of  the  problems  is  to  find  the  optimal  number  (m)  of  subcircuits 
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Fig.  12.  Coaparison  of  the  time  taken  to  perform  LU  factorization 
on  a  banded  matrix  and  a  BBDF  matrix  using  systolic 
arrays. 


to  divide  the  entire  ci renit. 


For  t(bend)  >*  t(partitioned) 

3n+p  >■  8n-8mp+4p 
(8n-3)p  >-  5n 

Thus  for  the  partitioned  matrix  to  be  faster  than  the  band  matrix, 
make  p  and  m  large. 

If  the  interconnection  snbcircnit  is  a  band  matrix,  each  border  node 
depends  only  on  w  nodes  (Fig.  13). 
t(snb)  *  4[p+w] 
t(int)  “  3[n-mp]+r 

t(part)  <  t(snb)  +  t(int)  *  4(p+w)  +  3(n-mp)  +  r 
.  t(band)  ■  3n+p 

For  t(band)  >*  t(part) 

3n+p  >-  4(p+w)  +  3(n-mp)  +  r 
3p(m-l)  >»  4w  +  r 
For  example,  given  r*l ,  p*2 ,  v“l 

m  >*-5/6+1  e.g.  a  *  2  ,3  ,4 . 

However,  the  problem  with  the  normal  BBDF  form  of  matrices  is 
that  the  interconnection  matrix  and  the  border  can  get  very  large  and 
sparse.  Then,  the  control  of  data  from  the  snbcircnits  to  the  inter¬ 
connection  level  can  be  very  complicated.  Thns,  an  algorithm  for 
partitioning  the  matrix  is  investigated.  The  algorithm  uses  a  clus¬ 
tering  procedure  to  obtain  a  'nested'  partitioned  matrix  form.  The 
reason  that  the  clustering  algorithm  can  be  useful  is  that  most  cir¬ 
cuits  considered  are  only  connected  to  a  few  other  nodes.  So  one  can 
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Fig.  13.  A  BBDF  matrix  with  sparse  interconections. 
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always  divide  the  whole  circuit  into  snbcircnits  which  are  connected 
only  through  a  relatively  small  number  of  interconnection  nodes. 

The  matrix  is  divided  into  BBDF  in  a  nested  manner,  starting 

ty  J 

with  2  submatrices,  then  matrices  up  to  2  matrices.  Each  subma¬ 
trix  is  also  in  BBDF.  This  method  is  used  to  simplify  the  intercon¬ 
nection  level  logic.  Tith  the  original  BBDF,  the  border  can  get 
very  large  but  sparse.  To  utilize  the  sparsity  maximally  requires 
very  complicated  and  sometimes  irregular  timing  controls.  However, 
using  a  nested  algorithm,  even  though  more  mesh  layers  may  be  needed, 
the  synchronization  of  data  is  very  simple. 

3 .2 .  The  Nested  Cluaterlna  Algorithm 

The  problem  with  any  heuristic  clustering  algorithm  is  the 
uneven  size  of  the  subcircuits.  Tiis  can  be  remedied  by  patching  the 
circuits  with  1 ' s  on  the  diagonal  and  zeros  everywhere  on  the  extra 
rows  and  columns  (Fig.  14)  .  The  tearing  nodes  chosen  may  not  be  the 
minimum  and  may  not  give  the  most  identical  number  of  clusters  since 
a  certain  small  number,  «  ,  has  to  be  chosen  around  the  vicinity  of 
nmax  allowed  for  each  level  of  nesting.  The  clustering  algorithm  is 
performed  by  the  following  procedure,  and  is  a  modification  of  the 
algorithm  given  in  [14]: 

1.  Choose  initial  iterating  node  [IS(1)]  with  minimum  degree. 

2.  Store  in  adjacency  set  [AS(1)]  all  nodes  that  are  adjacent  to  the 
node  in  IS(1). 

3.  PI  ice  the  cardinality  of  AS(1)  in  (N(l). 
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4.  Let  i-1. 

5.  If  Qi(i)"0,  check  if  any  node  is  not  covered  yet.  If  all  the  nodes 
are  considered,  stop;  otherwise,  continue. 

6.  Choose  next  iterating  node.  n(i+l),  froai  AS(i)  and  place  it  in 
IS(i+l)  by  a  greedy  strategy  with  ainiam  CN(  i+1)  . 

7.  Update  AS(i+l)  from  AS(i)  by  deleting  the  node  n(i+l)  and  adding 
all  new  nodes  thst  are  adjacent  to  n(i+l)  that  are  not  already  in 
AS( i)  or  in  the  union  of  the  sets  IS(j)  where  j  equals  1  to  i. 

8.  Pat  CN( i+1)  -  | AS( i+1) I  . 

9.  let  i°*  i+1. 

10.  Go  to  step  5  until  \rUZ\  nodes  have  been  selected.  Then  choose 
the  node,  k,  between  max  +  s  that  gives  the  ainiaua  CN.  sis  the 
saallest  integer  that  can  give  the  ainiaua  (N.  All  nodes  1  ->  k  fora 
a  cluster  with  all  the  nodes  in  the  adjacency  set  as  tearing  nodes. 

11.  Now  for  the  first  cluster,  divide  the  cluster  in  half  again  as 
in  step  10.  Repeat  step  11  until  the  clusters  are  at  a  ainiaua  size. 

12.  Also,  delete  the  first  cluster  and  its  tearing  nodes  and  con¬ 
tinue  steps  (6)  to  (10)  for  the  second  cluster.  Repeat  step  (11) 
until  the  clusters  are  at  a  ainiaua  size. 

Refer  to  Figs.  15a,  15b,  and  16  for  an  illustration  of  the  aigo- 
ritha.  The  clustering  slgoritha  is  a  software  solution  to  organizing 
the  interconnection  nodes  in  groups  before  processing  through  the 
processor  array.  Using  the  nested  BBDF,  aore  parallel  processing  can 
be  done  with  siapler  interconnection,  as  is  shown  next. 


Fig.  15a.  The  nested  clustering  algorithm  splits  the  entire  circui 
in  half  in  the  first  step. 


Fig.  15b.  Then  each  individual  cluster  is  furthur  split  in  half 
during  the  next  recursion. 
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i .1 .  Design  of  the  Modified  Systolic  Array 

The  hexagonal  systolic  array  is  orgranized  in  a  honeycomb  shaped 
mesh.  Its  topology  can  actually  be  implemented  in  a  rectangle.  The 
data  flow  of  snch  a  rectangnlar  systolic  array  will  still  be  the  same 
as  the  hexagonal  topology.  The  four  diagrams  shown  in  Fig. 17  [13] 
illustrate  the  data  flow  of  such  a  systolic  array.  Input  data  is  fed 
along  the  diagonals  of  the  matrix.  The  L  factors  are  then  taken  from 
the  left-hand  edge  of  the  network  while  the  U  factors  are  taken  from 
the  right-hand  edge  of  the  network.  For  the  given  30  x  30  nested 
BBDF  matrix,  the  input  timings  are  shown  in  Fig. 18.  The  numbers 
indicate  the  time  that  the  input  data  at  the  corresponding  position 
are  fed,  e.g. ,  element  (1.1)  is  fed  at  time  t^ .  The  output  timings 
are  shown  in  Fig.  19.  The  numbers  indicate  the  time  that  the  output 
comes  out  of  the  processor  array. 

The  inputs  are  fed  along  the  diagonal  of  the  matrix  with  the 
diagonal  element  first,  followed  by  the  rest  of  the  elements  along 
the  diagonal.  Zeros  are  fed  in  at  the  location  of  the  interconnection 
level  at  each  of  the  subcircuits.  After  all  the  subcircuit  data- 
independent  elements  are  fed  into  the  array,  the  rest  of  the  zeros 
are  fed  in  with  all  the  diagonal  elements  at  the  same  time.  Note 
that  the  input  data  streams  are  separated  from  each  other  by  a  clock 
pulse.  At  the  interconnection  location,  the  partial  sums,  £  Lik  * 
Ukj  are  carried  out. 

To  insure  the  correct  timing,  an  asynchronous  handshaking  scheme 
is  incorporated  on  top  of  the  synchronous  data  sequencing.  No 
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computation,  i.e.,  multiplication  and  addition,  is  done  until  the 
input  data.  A,  L  and  U,  have  their  input  valid  flags  on.  This  dis¬ 
tinguishes  the  normally  zero  input  data  elementa  from  the  one  a  at  the 
interconnection  location.  Also,  the  partial  sums  £  L*U  are  taken  out 
at  different  points  on  the  network  following  a  certain  pattern.  They 
are  then  fed  into  modules  which  add  up  all  the  partial  simis  from  the 
different  parallel  subcircuits.  The  number  of  interconnection  levels 
depends  on  the  depth  of  the  nested  BBDF.  Fig.  20  shows  the  overall 
configuration  of  the  LU  factorization  network. 

The  number  of  processors  used  are: 

Basic  Level:  (m  +  d*i)2 
Interconnect  1:  (m  +  (d-l)*i)^ 

Interconnect  2:  (m  +  (d-2)*i)^ 

Interconnect  d:  (m  +  i) 

i  =  size  of  interconnection  network  (border) 
d  ■  depth  of  the  nested  BBDF 
m  «  size  of  the  subcircuits 

Hwang's  scheme  for  LU  factorization  is  also  modified  and  then 
applied  to  the  nested  BBDF  matrix.  Both  LU  decomposition  and  forward 
substitution  are  very  easily  combined  together  into  the  same  mesh 
connected  network.  The  configuration  and  the  input-output  timings 
for  a  4  x  4  matrix  are  shown  in  Fig.  21. 

To  solve  A  >  LU  and  Lz  *  y,  the  y  vector  is  put  in  as  an  extra 
column  in  the  A  matrix.  Then  the  elements  of  the  matrix  are  fed  into 
the  array,  one  column  at  a  time.  A  '1'  is  fed  in  at  the  diagonal 


input  position  every  other  clock  poise  so  that  s  delay  can  be  pot  in 

between  two  inpot  colomns  withoot  the  eleaents  being  changed  in  the 

network.  The  L  factors  then  coae  oot  on  the  opper  edge  and  the  U 

factors  f ran  the  left  edge.  Asynchronoos  handshaking  is  also  applied 

here,  except  that  the  partial  sons  are  taken  oot  right  before  they 

enter  the  division  cells  on  the  top  row  and  along  the  U  ootpot 

latches  on  the  left  edge.  The  partial  sobs  of  all  the  sobeircoits 

within  each  level  are  then  added  to  the  matrix  inpot  eleaents.  The 

resnlts  are  the  inpots  to  the  next  level.  The  dark  lines  in  between 

« 

the  cells  are  interface  latches  to  hold  inpnts  and  ootpots  ontil  the 
next  clock  poise  arrives.  Note  also  that  the  systea  clock  aost  be 
long  enoogh  to  accoaaodate  for  the  time  it  takes  to  do  the  coapnta- 
tion  or  the  time  it  takes  to  shift  the  inpot  data  to  the  ootpot.  The 
inpot  and  ootpot  timings  for  a  30  x  30  nested  BBDF  matrix  are  shown 
in  Figs.  22  and  23.  respectively. 

For  each  mesh  connected  array  for  LU  de compost ion  (Kai  Hwang [10]): 

2 

Noaber  of  reqnired  arithmetic  cells  -  n  -  n 
Noaber  of  I/O  terainal  »  4n  -  2 

Startop  time  del ay (time  delay  to  get  the  first  ootpot)  »  n  -  1 
Array  net  compote  time  ■  2n  -  1 
Total  coapote  time  ■  3n  -  2 

If  forward  sobstitotion  is  incloded  in  the  ssae  processor  array, 
together  with  LU  decoaposi tion,  the  following  modifications  occor. 

For  each  aesh  connected  array: 

Noaber  of  reqoired  arithmetic  cells  ■  (n+1) (n-1) 
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Fig.  22.  Input  timing  for  LU  decomposition  and  forward  substitution 
using  the  modified  Hwang  method. 
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Number  of  I/O  terminal  »  4n  -  1 
Startup  tiae  delay  -  n 
array  net  compute  tiae  “  2n 
Total  coapute  tiae  *  3n 

The  systea  configuration  is  shown  in  Fig.  24  and  the  interconr 
nection  for  a  7  z  5  array  to  a  5  x  3  array  is  shown  in  Fig.  25. 
Backward  substitution  uses  a  linear  systolic  array  (Fig.  26).  Solu¬ 
tions  are  obtained  from  the  bottom  of  the  matrix  up  to  the  top. 

Also,  the  first  computation  can  only  be  done  after  the  solutions  of 
< 

the  forward  substitution  are  obtained.  The  input  and  output  timings 
are  given  in  Fig.  27. 

From  the  results,  it  can  be  concluded  that  the  input  and  output 
timings  of  this  modified  systolic  scheme  are  more  regular  and  easier 
to  control  than  the  hexagonal  systolic  array.  Also,  fewer  processors 
and  less  time  are  needed  to  accomplish  both  LU  decomposition  and  for¬ 
ward  substitution.  Thus,  in  the  following  section  the  modified  Hwang 
implementation  is  studied  for  its  performance  time. 
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Fig.  26.  Linear  processor  array 


for  backward  substitution  (Ux*=z) 


1.4.  Performance  Evaluation 


Timing  Analysis  of  the  Nested  Algorithm  using  the  modified  Hwang 
implementation  for  the  example  nested  matrix: 


Time 

Start  Input 

Get  LU  factors 

Get  first  z 

Basic  Level 

tl 

dxi+p  ■  pi 

tl+2pl 

Interconnect  1 

3pl-2(dxi>+2 

t ( inti ) +( d-1 ) xi+ i 

t(intl)+2p2 

-  t(intl) 

(d-l)xi+i  -  p2 

Interconnect  2 

t ( inti) +3p2-2 ( d“l) i+2 

t  ( int2 )  +( d-2)  xi+  i 

t(int2)42p3-l 

*  t ( int2 ) 

(d-2)xi+i  -  p3 

Interconnect  3 

t(int2)+3p3-2(d-2)i+2 

t(int3)+i 

t  (  int3  )  +2  i— 1 

-  t(int3) 

T(LD*FWD)  »  3  (d*i  +  p)  +  5i  +  6 
n  -  2d  x  i  +  dxi 

Backward  substitution  *  3  •  2d  -  1 
Total  time  -  3*<d*i  +  p)  +5i+5  +  3*2d 

Assume  processors  are  very  cheap  so  that  the  computation  time  is 
the  main  consideration  of  performance.  Note  that  this  network  can 
compute  LU  factorization  in  0(n)  time.  It  is  thus  a  very  efficient 
processing  array  for  LD  factorization. 


2 


(XAPIEK  4 

DISECT  1CTH0D  -  OVERALL  SZS1EM  CONFIGURATION 


There  are  tiro  main  methods  for  solving  large  systems  of  linear 
equations,  the  direct  and  the  indirect  method.  The  direct  method,  as 
the  name  implies,  feeds  the  entire  matrix  into  the  processor  array  in 
one  pass,  and  solves  the  system  of  linear  equationa.  The  indirect 
method  will  be  discussed  in  the  next  chapter. 

The  system  for  the  direct  method  of  LU  factorixation  consists  of 
the  host  interface,  the  main  memory(MM),  a  control  unit(CU),  the 
memory  management  unit(MMU),  the  sequencer  unit(SU),  the  feedback 

buffer (FB) ,  the  input  processor  switch  (IPS)  and  the  processor 

array(PA).  The  block  diagram  of  the  system  is  shown  in  Fig.  28. 

The  host  interface  communi ca te s  with  the  host  computer  to  get 
the  commands  and  data  to  process  the  LU  factorization  operation.  The 
matrices  are  first  reordered  by  reordering  programs  implementing  the 
nested  clustering  algorithm  in  the  host  computer.  The  matrices  are 
then  represented  in  partitioned  bordered  block  diagonal  form.  The 
non-zero  values  are  then  stored  with  their  row  and  column  coordi¬ 
nates.  Since  the  amount  of  data  is  large  compared  to  the  operation 

set,  the  control  signals  are  embedded  into  the  data  stream.  Also, 
the  host  interface  recognizes  only  partitioned  matrices  and  vectors, 
so  the  host  provides  a  data  separator  for  each  submatrix  and  vector 
segment  in  the  data  stream.  The  host  interface  then  generates  a 
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proper  data  representation  each  time  a  separator  is  encountered. 
This  data  representation  and  the  instruction  words  are  then  sent  to 
the  CD  while  data  is  sent  to  the  main  memory.  The  host  interface 
tries  to  allocate  potentially  parallel  data  objects  into  different 
parts  of  the  memory  hierarchy  to  be  fetched  through  different  ports. 

The  control  unit  (CD)  decodes  the  host  interface  instructions  on 
matrices  into  sequencer  instructions  on  submatrices  and  vector  seg¬ 
ments.  Thus  each  task  requested  by  the  host  is  partitioned  into 
several  subtasks  which  may  be  carried  out  in  parallel.  Each  subtask 
is  carried  out  by  the  processor  array  under  supervision  of  the 
sequencer.  The  CD  thus  contains  information  on  the  number  of  nested 
levels  of  paralle  submatrices  as  well  as  the  sizes  of  the  subma¬ 
trices.  The  control  unit  also  sends  commands  to  the  input  processor 
switch  to  enable  and  disable  processors  depending  on  the  size  of  the 
matrix  to  be  processed.  It  also  controls  the  feedback  buffer  which 
organizes  the  output  solutions  from  the  processor  array. 

The  memory  management  unit  (MMD)  manages  the  main  memory  system. 
A  virtual  addressing  scheme  is  used  because  the  amount  of  data  of 
each  operand  matrix  can  be  very  large.  The  operand  address  received 
from  the  CD  is  the  starting  virtual  address  of  a  submatrix  or  vector 
segment.  The  MMD  computes  the  ending  virtual  address  from  it  and 
translates  both  virtual  addresses  to  physical  addresses.  The  MMD  is 
required  to  fetch  data  for  concurrent  tasks  from  the  multiport 
memory.  The  host  interface  would  try  to  allocate  potentially  paral¬ 
lel  data  objects  into  different  parts  of  the  memory  hierarchy  to  be 
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fetched  through  different  ports. 

The  sequencer  unit  (SU)  adds  different  delays  to  data  loaded  by 
the  MMJ  before  sending  then  to  different  processor  rows  or  colcmns 
depending  on  the  size  of  the  subnatrices.  The  delays  are  synchron¬ 
ized  by  s  system  clock  which  controls  the  processor  array. 

The  input  processor  switch  (IPS)  receives  instructions  from  the 
control  unit  to  deteznine  which  processor  elements  are  enabled 
according  to  the  size  of  the  matrices.  Then  the  input  data  are  then 
automatically  switched  from  the  SC  to  the  appropriate  processing  ele¬ 
ment. 

The  feedback  buffer  (FB)  is  nsed  when  the  data  output  from  the 
processor  array  is  re-sent  to  the  processor  array.  It  also  arranges 
the  outputs  in  an  organized  fashion  to  be  sent  to  the  MMJ  for  virtual 
address  translation. 

The  processor  array  (PA)  has  been  discussed  in  the  previous 
chapter.  The  array  is  controlled  by  instructions  from  the  CC  as 
well.  The  processor  array  shown  is  the  semi- synchronous  implementa¬ 
tion  for  the  nested  BBDF  matrices.  The  processor  array  is  comprised 
of  planes  of  processing  elements;  each  subsequent  plane  takes  care 
of  the  interconnection  between  two  previously  parallel  arrays  for  LU 
factorization  and  forward  substitution.  The  results  are  then  fed 
into  processing  elements  used  for  backward  substitution. 

This  chapter  gives  the  general  description  of  each  of  the  blocks 
in  the  special  purpose  machines  used  for  LU  factorization  in  the 
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INDISECT  METHOD 


An  alternative  method  for  solving  large  systems  of  linear  equa¬ 
tions  is  an  indirect  method*  such  as  the  Gauss- Seidel  [16]  relaxation 
method.  Special  purpose  hardware  can  also  be  built  to  accomplish 
this  algorithm  instead  of  having  it  done  in  software. 

The  block  diagram  of  snch  a  system  is  shown  in  Fig.  29.  The 
indirect  method  first  makes  a  guess  on  the  values  of  the  interconnec¬ 
tion  nodes.  The  partitioned  submatrices  from  1  to  k  can  then  be 
evaluated  using  LU  decomposition,  forward  and  backward  substitution. 
The  nested  clustering  algorithm  for  partitioning  matrices  is  no 
longer  needed.  The  matrix  can  be  partitioned  in  normal  BBDF  form. 
The  following  equations  are  solved  in  parallel: 

(L1  *  Dl)  X1  "  *1  ~  P1  *  xt 


(Lk  *  Uk)  xk  “  yk  "  Pk  *  xt 

where  is  the  initial  guess.  Then,  the  solutions  x^  to  x^  are  used 
to  compute  the  values  of  the  interconnection  node. 

-  jrt  -  Zih  nT  It 

After  the  new  values  of  the  interconnection  node  are  computed,  they 
are  compared  to  the  previous  guess.  If  the  solutions  are  off  by  more 
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Fig.  29.  Block  diagm  for  the  indirect  aethod  hardware 


than  the  tolerance  allowed,  then  the  new  zt'a  are  need  to  re-evaluate 
the  values  of  x^  to  x^  until  the  solutiona  converge.  Latency  can  be 
explored  if  some  of  the  valnea  of  x^  to  x^  axe  within  the  tolerance 
allowed. 

The  difference  between  the  indirect  and  the  direct  method  ia 
that  the  anbmatricea  are  completely  decoupled  from  each  other  in  the 
indirect  method  becanae  an  initial  gneaa  for  the  valnea  of  xt,  the 
interconnection  node  valnea.  ia  made.  Since  each  anbmatrix  ia  con¬ 
nected  to  another  solely  by  the  interconnection  nodes,  decoupling  as 
a  result  of  gnesses.  solves  the  problem  of  the  complexity  of  control¬ 
ling  the  interconnection.  However,  vector  multipliers  and  subtrac¬ 
tors  must  be  added  to  Compute  the  solutions  .  A  processing  element 
for  LU  decomposition  can  be  used  to  find  y^  -  •  xt,  whereas  a  for¬ 
ward  substitution  array  made  of  the  same  processing  elements  can  be 

_  k  r 

used  to  obtain  yt  -  *  x^.  In  addition,  a  comparator  must  be 
used  to  compare  the  values  of  x^  with  the  previous  iteration  to  check 
for  convergence. 

The  direct  and  indirect  method  have  different  appli cations.  If 
the  circuit  is  approximated  with  a  simplified  model,  e.g.,  in  RELAX 
and  SPLICE,  the  indirect  method  is  faster  because  the  solutions  con¬ 
verge  quickly.  However,  there  are  inaccuracies  in  the  solutions.  In 
most  cases,  these  inaccuracies  are  trivial.  The  indirect  method  con¬ 
verges  quickly  for  circuits  with  very  few  couplings.  However,  if  a 
detailed  model  of  the  circuit  or  some  complex  circuits  such  as  analog 


circuits  or  digital  circuits  with  parasitics*  are  to  be  analyzed* 
then  the  direct  method  ,  e. g. *  the  one  uaed  in  SPICE  [18] *  is  faster 
and  more  accurate.  Using  the  indirect  method  for  this  application 
may  lead  to  solutions  that  do  not  converge.  Thus*  both  the  direct 
and  indirect  method  are  used  depending  on  the  types  of  circuits  to  be 


solved. 
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CHARES  6 

CONCLUSIONS  AND  FUTUBE  1ESEAHG1 


This  thesis  is  s  prel in  inary  investigation  into  s  computer 
architecture  to  be  used  for  solving  large  systems  of  partitioned 
sparse  matrices  repesenting  the  connections  of  electrical  circuits. 
A  highly  concurrent  parallel  architecture  is  proposed.  It  is 
expected  to  be  a  powerful  tool  which  is  aimed  at  speeding  up  the  LU 
factorization  of  these  matrices  in  order  to  solve  the  equations. 

A  special  purpose  architecture  is  chosen  over  a  general  purpose 
architecture.  There  are  many  advantages  and  disadvantages  of  both 
types  of  architectures.  However,  in  spite  of  the  inextensibility  of 
the  special  purpose  architecture,  the  special  purpose  architecture  is 
simpler  to  design  and  is  sufficient  for  our  application.  This  type 
of  architecture  can  offer  a  faster  speedup  because  no  time  is  wasted 
in  decoding  instructions.  Three  different  types  of  special  purpose 
array  processors  have  been  studied  and  compared:  the  systolic  array, 
the  Hwang  processor  array  and  the  wavefront  array  processors. 

In  order  to  achieve  maximum  concurrency,  the  matrix  is  best 
ordered  in  a  nested  bordered  block  diagonal  form.  A  modified  clus¬ 
tering  algorithm  is  presented  based  on  a  heuristic  method  of  ordering 
these  sparse  matrices.  However,  using  this  algorithm,  the  number  of 
clusters  (or  submatrices)  that  result  are  always  in  powers  of  two. 


The  LU  faetors  at  the  borders  of  these  nested  BBDF  matrices  sre 
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easier  to  solve  then  the  normal  BBDF.  This  is  because  the  intercon¬ 
nection  nodes  are  also  decoupled  using  this  method,  thus  making  the 
borders  of  the  submatrices  a  small  size.  Separate  processor  arrays 
can  be  used  to  evaluate  the  LU  factors  as  well  as  performing  the  for¬ 
ward  substitution  of  all  the  submatrices  at  the  lowest  level.  The 
results  of  two  different  submatrices  can  then  be  summed  together  and 
become  input  to  the  interconnection  processor  array  in  the  next  level 
until  the  entire  matrix  is  computed.  Backward  substitution  can  then 
be  applied  to  the  resulting  matrix. 

In  order  to  facilitate  VLSI  implementation  to  reduce  cost  and 
computation  time  as  well  as  to  simplify  control,  a  highly  modular 
computing  structure  with  local  communications  is  probably  the  best 
strategy.  All  of  the  three  processor  arrays  mentioned  above,  namely, 
the  systolic  array,  the  Hwang  procesaor  array  and  the  wavefront  array 
processors  have  these  properties.  The  systolic  array  uses  a  synchro¬ 
nous  data  flow.  The  processing  elements  are  all  identical  with  local 
interconnections.  The  Hwang  processor  array  uses  latches  at  the 
border  to  latch  inputs.  The  wavefront  array  processors  use  an  asyn¬ 
chronous  handshaking  scheme  for  local  communi cations.  The  synchro¬ 
nous  scheme  creates  problems  as  the  clock  skew  grows  due  to  the 
increase  in  the  size  of  the  processor  array.  However,  the  asynchro¬ 
nous  scheme  may  cause  race  conditions  and  data  conflicts  if  the  data 
is  not  synchronized  correctly.  The  Hwang  processor  array  makes  max¬ 
imum  use  of  processors;  however,  latches  must  be  used  at  the  border. 
This  destroys  part  of  the  simplicity  and  regularity  of  the  data  con- 


trol.  To  compromise  between  ell  the  tradeoffs  of  these  architec¬ 
tures,  a  modified  echeme  is  designed  to  incorporate  the  characteris¬ 
tics  of  all  these  types  of  processor  arrays.  The  processor  array 
proposed  uses  a  synchronous  data  flow  to  input  and  output  data  at 
each  submatrix.  A  system  clock  is  used  to  synchronize  the  data. 
However,  certain  handshaking  signals  (input  and  output  valid  flags) 
are  also  used  so  that  the  computations  will  not  be  done  unless  these 
flags  are  present.  This  scheme  essentially  adds  in  wait  states  and 
is  useful  when  the  outputs  flow  from  one  level  to  the  next  intercon¬ 
nection  level.  Several  problems  ,  however,  may  arise.  Race  condi¬ 
tions  and  conflicts  may  occur.  The  data  must  be  able  to  be  latched 
in  each  processing  element  long  enough  before  all  the  valid  signals 
are  present.  Also,  the  input  data  must  be  monitored  so  that  data 

will  not  be  overwritten  while  waiting  for  the  valid  signals. 

: 

The  Hwang  VLSI  structures  are  also  included  in  the  processor 
array  because  latches  are  used  to  buffer  the  data  instead  of  using 
registers  inside  the  processing  elements.  This  implementation  offers 
maximum  processor  utilization.  In  addition,  the  inputs  are  fed  in 
column  by  column  and  are  thus  easily  controlled.  The  processor  array 
proposed  can  solve  LU  decomposition  and  forward  substitution  in  0(n) 
time.  Backward  substitution  is  implemented  by  a  linear  array  which 
can  also  compute  in  0(n)  time. 

There  are  two  methoda  for  solving  a  large  system  of  linear 
equations,  the  direct  method  and  the  indirect  method.  The  direct 
method  solves  the  entire  matrix  in  one  pass.  The  indirect  method 
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guesses  st  the  values  of  the  interconnection  nodes  initially  to  solve 
for  the  rest  of  the  circuit.  Then  the  values  of  the  interconnection 
nodes  are  updated  and  the  vhole  process  is  repeated  until  the  solu¬ 
tions  converge.  It  is  still  unclear  which  method  is  actually  better 
in  terms  of  speed.  The  performance  of  the  indirect  method  depends  on 
how  good  the  initial  guess  is  so  that  the  solutions  converge  quickly 
provided  they  converge.  A  general  description  of  the  overall  system 
configuration  of  both  methods  has  been  given.  This  thesis  deals  with 
the  processor  array  in  detail.  The  design  of  a  single  processor  ele¬ 
ment  is  also  given  in  the  Appendix.  However,  the  hardware  and  the 
software  controls  of  the  rest  of  the  system  have  yet  to  be  designed 
in  detail. 

A  simulator  will  be  written  to  simulate  the  data  flow  between 
all  submatrices  and  their  interconnection  levels.  The  results  from 
this  simulator  give  the  time  when  data  is  input  and  output  at  all  the 
processing  elements.  These  results  are  useful  for  monitoring  the 
control  of  the  data  for  any  size  of  matrix.  A  hardware  descriptive 
language  [17],  which  implements  the  basic  instruction  set  needed  to 
control  the  processing  elements,  must  be  defined.  The  language  can 
be  similar  to  the  wavefront-oriented  language  by  S.Y.Kung  [13],  but 
it  should  be  more  special  purpose  and  simple  to  design.  Another 
simulator  can  then  be  written  to  describe  the  subroutines  used  to 
synchronize  the  processing  elements.  This  simulator  describes  the 
sequence  of  instructions  that  each  processing  element  recieves.  Data 
can  be  fetched  from  and  flowed  to  any  adjacent  cells  in  the  up,  down. 


[17]  If  every  occurrence  of  a  FLOf  <to  direction)  instruction  of  a 
data  sourcing  HJ  is  matched  by  an  occurence  of  a  FETCH  <from  oppo¬ 
site  direction)  instruction  in  the  instruction  sequence  of  the 
appropriate  recieving  PE,  i.e.,  every  FLOW  is  matched  both  in  number 
of  occurences  snd  sequencing  of  appearance  to  a  FETCH  in  the  same 
phase.  Data  bus  contention  problems  must  also  be  solved.  The  nested 
clustering  algorithm  can  also  be  implemented  to  reorder  the  input 
matrices.  Uneven  depths  of  clusters  can  be  incorporated  as  veil. 

In  terms  of  hardware,  the  processor  arrays  can  be  designed  so 
that  they  can  deal  with  matrices  of  any  arbitrary  size,  either  larger 
or  smaller  than  the  hardware  can  accommodate.  Larger  matrices  need 
to  be  partitioned  into  smaller  matrices  while  smaller  matrices  are 
patched  with  l's  in  the  diagonal  and  0's  in  the  rest  of  the  rows  and 
columns.  Once  the  software  is  finished,  the  details  of  all  the  other 
blocks  in  the  system  configuration  can  be  implemented.  A  floating 
point  processing  element  can  be  designed  in  CMOS  so  that  the  array 
can  be  used  for  circuit  simulation.  The  CMOS  cell  will  contain  24 
bits  for  the  mantissa  snd  8  bits  for  the  exponent.  Local  memory, 
e.g. ,  a  RAM,  can  be  added  into  each  processing  element.  This  reduces 
the  memory  access  time  and  can  thus  yield  higher  performance. 

Fast  simulation  engines  can  be  the  key  to  improving  circuit 
simulation  time  once  a  good  algorithm  has  been  developed.  LU  factor¬ 
ization  is  one  of  the  time-consuming  loops  in  simulation  programs. 


APPENDIX:  DESCRIPTION  OF  A  PROCESSING  ELEMENT 


A  16-bit  fixed  point  processing  element  (PE)  has  been  designed 
nsing  a  3nm  CMOS  VLSI  technology.  The  PE  accepts  16-bit  operands, 
with  8  bits  representing  the  fractional  part  and  8  bits  representing 
the  integer  part.  The  four  major  functions  that  it  performs  are: 
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Function  (1)  performs  the  update  of  the  matrix  elements  during 
LD  factorization.  Function  (2)  solves  for  the  L  (lover  triangular) 
factors  while  function  (3)  solves  for  the  U  (upper  triangular)  fac¬ 
tors.  Function  (4)  is  used  when  addition  is  needed  at  the  intercon¬ 
nection  level. 

The  cell  basically  consists  of  a  fixed-point  adder,  a  multi¬ 
plier,  a  divider,  a  function  decode,  several  muxes  and  latches.  The 
pin  specifications  of  the  chip  are  given  in  Fig.  30.  The  functional 
block  diagram  of  the  cell  is  shown  in  Fig.  31.  The  cell  is  designed 
in  CMOS  because  of  low  power  dissipation  which  is  important  in  an 
array  of  large  numbers  of  processing  elements. 

The  design  of  the  cell  is  semi- synchronous,  level  sensitive  and 
combinational.  It  is  semi- synchronous  because  Avin,  Lvin  and  Uvin 
are  flag  signals  to  enable  the  computation  (i.e.,  the  addition  and 
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Total  Number,  of  Pins  *  55  pins 
Pin  Assignment: 

a  —  a<k' 

3:n  3ij 

—  Input/Output  Pin  (8  bits) 
avin~  ai(jkl  Valid 
—  Input  Pin 
U  -  U  (k) 

in  wkj 


-  Input  (8  bits) 
Uv,„-  U Valid 

—  Input 
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u'n  'ik 

—  Input  (8  bits) 
U,n-  [lk)  Valid 

-  Input 


Aout  =  Ain  —  LinUin 
Lout  =  Ain/Uin 
Uout  *  Ain 

Aout  =  Ain  +  Lin  +  Uin 
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01 (Phi) 
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Cycle  2(CY2) 
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Fig.  30.  Pin  assignment  of  a  CMOS  proce&aing  element. 
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Fig.  31.  Logie  block  diagram  for  the  CMOS  processing  element 
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multiplication)  when  the  system  clock  goes  high.  It  is  level  sensi¬ 
tive  because  it  responds  to  the  logic  level  of  the  clock  instead  of 
the  clock  edge.  TVo  cycles  of  a  two-phase  clock  are  used  for  each 
computation  of  a  processing  element.  It  is  combinational  because  no 
feedback  path  exists  within  the  cell  so  that  a  finite  state  machine 
is  not  needed. 

The  inputs  of  A,  L  and  U  are  time-multiplexed  into  the  circuit 
because  of  the  limitations  on  the  nmaber  of  I/O  pins.  This  scheme, 
however,  slows  down  the  fetching  of  operands.  t  The  entire  operation, 
fetching  of  data  and  computation  takes  two  cycles.  The  timing 
diagram  of  the  control  signals  and  the  data  is  shown  in  Fig.  32. 
During  the.  first  cycle  (CY1) ,  phase  1  latches  the  least  significant 
bit  LSB  data  (Ain,  Lin,  Din,  Avin,  Lvin,  Uvin)  from  the  temporary 
latches  while  the  most  significant  bit  MSB  data  are  latched  in  phase 
2.  The  LSB  and  MSB  temporary  latches  are  loaded  in  01  and  02  of  the 
previous  cycle  2.  respectively,  from  the  output  of  an  identical  adja¬ 
cent  processing  element.  Also,  the  input  Ain  and  output  Aout  share 
the  same  I/O  pins  because  of  pin  limitation  and  the  fact  that  the 
next  inputs  are  streamed  from  another  chip  at  one  or  more  clock 
cycles  after  the  outputs  are  generated  in  this  chip.  Thus,  no  I/O 
conflicts  should  occur  if  data  are  properly  synchronized  in  the  pro¬ 
cessor  array.  Lin  and  Uin,  on  the  other  hand,  cannot  share  the  pins 
with  Lout  and  Uout  because  data  can  be  input  and  output  at  the  same 
time  during  an  operation. 
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A  domino  CMOS  design  (Fig.  33) is  used  for  the  function  decode 
instead  of  the  conventional  dual  NAND  and  NOR  gates  because  it  is 
simpler  to  design  and  takes  up  less  area.  The  function  decode  is 
precharged  during  01  and  CY1  and  the  function  is  evaluated  when  Pi 
and  CY1  go  low,  i.e. ,  when  P2  and  CY1  are  high.  Then  SEL1  and  SELO 
must  be  held  at  the  same  voltage  until  the  next  pi  and  CY1 ,  otherwise 
the  function  selected  will  be  changed  during  this  operation.  The 
computation  is  somewhat  pipelined  by  two  different  pipes. 

If  a  multiply- addition  (FO)  is  selected  the  multiplier  outputs 
are  enabled  during  02  and  CY1.  As  soon  as  the  MSB  of  the  data  get 
latched,  the  computation  will  be  done  right  before  CT1  goes  low. 
Then  the  adder  is  enabled  at  01  and  CT2  and' results  are  latched  in 
the  output  multiplexer  (mux).  The  LSB  of  the  output  is  generated  at 
01  and  CT2  and  the  MSB  of  the  output  at  02  and  CT2.  If  a  division  is 
selected  (FI) ,  the  reciprocator  output  is  enabled  during  02  and  CT1 
and  the  results  are  processed  during  01  and  CY2.  Then  the  results 
are  carried  to  the  L  latches. 

No  results  from  the  adder,  multiplier  and  reciprocator  are 
latched  if  the  valid  signals  of  all  (A,  L,  and  D)  have  not  arrived. 
After  they  arrive,  the  valid  signals  being  passed  (as  output)  to  the 
next  cell  are  enabled  during  cycle  2. 

Parity  and  overflow  are  both  generated  by  the  multiplier  and 
adder,  and  when  reciprocation  is  done,  the  parity  and  overflow  of  the 
multiplier  is  output  whereas  those  of  the  adder  are  normally  being 
generated. 


fO  *SEL1  SEL0  01  *CY1  High  Precharges  the  Nodes  While 
fl  »SEL1  SELO  01 'CY1  Evaluates 
f2  -SEL1SEL0 
f3  -SEL1SEL0 


Fig.  33.  Function  decoder  with.  domino  OIOS  design. 
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Two  schemes  were  considered  for  the  design  of  the  adder.  s 
ripple-carry  and  a  carry- look-ahead.  Ripple-carry  is  extremely  slow 
because  the  carry  has  to  propagate  to  the  previous  full  adder  before 
the  full  adder  can  perform  the  addition.  The  adder  on  the  chip  has 
the  following  functions: 

(i)  Ain  +  Lin  +  Uin 

(ii)  Ain  -  Lin  *  Uin 

Thus  a  three- ope rands  to  two-operands  adder  is  put  in  front  of  a 
carry-look-ahead  adder.  The  carry- look-ahead  adder  is  shown  in  Fig. 
34.  The  parity  and  overflow  are  also  generated. 

The  input  muxes  to  the  multiplier  are  enabled  so  that  when  the 
function  is  FO,  Uin  *  Lin  is  perfozsed  while  Uin-1  •  Ain  is  performed 
when  the  function  is  FI.  The  input  muxes  to  the  adder  are  enabled 
such  that  when  the  function  is  not  F3,  then  Ain  -  Lin  *  Uin  is  per¬ 
formed  except  for  F2  when  the  output  of  the  adder  is  not  used.  Dur¬ 
ing  F3.  Ain  +  Lin  +  Uin  is  performed  and  the  inputs  are  chosen  as 
appropriate. 

The  L  output  is  valid  when  the  function  is  FI  and  Avin.  Uvin 
and  Lvin  are  asserted  and  the  clock  cycle  is  cycle  2;  or  function  is 
not  FI  and  Lvin  is  valid  and  the  clock  cycle  is  cycle  2.  The  U  out¬ 
put  is  valid  when  function  is  F2  and  Avin.  Uvin  and  Lvin  are  asserted 
and  the  clock  cycle  is  cycle  two;  or  the  function  is  not  F2,  but  Uvin 
and  cycle  2  are  high.  The  A  output  is  valid  when  Avin  and  Lvin  and 
Uvin  are  high  and  the  clock  cycle  is  cycle  2. 


ahead  adder 


Hie  reciprocator  1*  based  on.  a  non-restoring  division  principle 
[23] .  The  block  diagram  of  the  reciprocator  and  a  basic  cell  in  the 
reciprocator  are  shown  in  Figs.  35  and  3d.  respectively.  Each  cell 
performs  the  functions: 

z  -  x  xor  [a  (  y  xor  t)] 
u  ■  (x  xor  b)  (y  +  t)  +  yt 

The  multiplier  (Fig.  37)  is  a  two's  complement  combinational 

array  multiplier  based  on  Booth's  algorithm  [23].  The  main  cell  is 
basically  the  same  as  the  reciprocator  cell.  An  extra  colmn  of  cell 
is  added  to  the  16  x  16  cells  to  generate  the  true  sign  of  the  par¬ 
tial  products.  Also,  the  last  cell  of  this  colmn  can  be  xor'ed  with 

the  MSB  of  the  16  x  16  multiplier  result  to  generate  the  overflow 
check. 

A  two-phase  clock  is  used  with  two  cycles  in  each  phase.  Phase 
one  must  be  long  enough  for  both  the  adder  to  finish  addition  or  for 
the  multiplier  to  finish  multiplication  and  also  for  the  output  latch 
to  latch  in  the  LSB.  Phase  two  must  be  long  enough  for  both  tne 
input  latch  to  latch  the  MSB  or  the  output  latch  to  latch  the  MSB  and 
also  for  the  reciprocator  to  finish  calculation. 

Note  that  the  multiplier  scheme  using  Booth's  algorithm  calcu¬ 
lates  in  moderate  speed  and  it  takes  up  a  lot  of  area.  A  faster  mul¬ 
tiplier  can  be  implemented.  In  addition,  reciprocation  and  then  mul¬ 
tiplication  can  be  replaced  by  just  a  divide  using  a  divider  so  that 
the  calculation  needs  only  go  through  one  stage.  Also,  p\  can  be 
shortened  too.  A  BAM  can  also  be  added  so  that  intermediate  results 


Fig.  35.  The  reciprocator 
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can  be  stored  in  local  meaory  and  can  shorten  memory  access  time.  In 
addition,  a  ROM  can  also  be  added  to  store  the  instruction  sequences 
of  each  proceasing  element.  A  floating  point  procesaing  element  can 
be  designed  using  the  same  functional  control  described  in  this 
appendix. 
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